-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fix iterating for best centroid when algorithm is neighbour aware and decrease SAMPLES_PER_CLUSTER_DEFAULT #130069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
| static final int MAXK = 128; | ||
| static final int MAX_ITERATIONS_DEFAULT = 6; | ||
| static final int SAMPLES_PER_CLUSTER_DEFAULT = 256; | ||
| static final int SAMPLES_PER_CLUSTER_DEFAULT = 64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth including some of the runs you were doing in the PR comments just so we can look back at them if we need to to confirm recall wasn't hurt by doing this
I'll run a couple runs myself here real quick too to double check with a different model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ran this whole PR and just the sampling change only on glove 200d 1m, 3m and 10m and for both saw no major drops in recall
| * @param sampleSize the subset of vectors to use when shifting centroids | ||
| * @param maxIterations the max iterations to shift centroids | ||
| */ | ||
| public static void cluster(FloatVectorValues vectors, float[][] centroids, int sampleSize, int maxIterations) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is only used for tests and kinda silly now you can just get rid of this or I can clean it up in a subsequent PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's clean up in a follow up PR
john-wagster
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
… decrease SAMPLES_PER_CLUSTER_DEFAULT (elastic#130069) * KMeansIntermediate shares assigments
The current loop is not doing the right thing when there are neighbours. We also decrease SAMPLES_PER_CLUSTER_DEFAULT as it is too high.